Loading the required packages
library(tidyverse)
library(dplyr)
library(ggplot2)
Import processed data, which can be found here.
#read preprocessed data
wines <- read.csv(file = '../data/processed_data/wines.csv')
Get sample of dataset
#set seed value to birthday of Ricardo Rodriguez, American wrestler and ring announcer and Dr. Reinaldo (Rei) Sanchez-Arias
set.seed(19630217)
#set percentage to test with for simplicity, if needed
percentage <- 5
wine_sample<- sample_n(wines, percentage/100*nrow(wines))
To make our dataframes more managable we split reduntant information about the tasters into a new dataframe.
tasters <- wines %>%
select(taster_name, taster_twitter_handle) %>%
unique()
tasters
Drop taster_twitter_handle in wines dataframe
wines <- wines %>%
select(-taster_twitter_handle)
head(wines)
Each reviewer has there own bias. In order to offset that we made a “profile” for each reviewer, This profile allows us to later normalize the wine points for more robust apples to apples comparison. Each reviewers profile includes additional characteristics like: - avg_points which is the avgerage of all the reviewers scores - sd_points which is the standard deviation of all the reviewers scores - var_points which is the variance of all the reviewers scores - reviews which is the number of reviews conducted
taster_rating_profile <- wines %>%
group_by(taster_name) %>%
summarize(
avg_points = mean(points),
sd_points = sd(points),
var_points = var(points),
reviews = n()
)
tasters <- inner_join(tasters, taster_rating_profile, by = "taster_name")
head(tasters)
Wines are normally classified in categories as found on the website. To create a more rich dataset we added the field rating_category determined as:
| Category | Rating | Description |
|---|---|---|
| Classic | 98-100 | The pinnacle of quality. |
| Superb | 94-97 | A great achievement. |
| Excellent | 90-93 | Highly recommended. |
| Very Good | 87-89 | Often good value; well recommended. |
| Good | 83-86 | Suitable for everyday consumption; often good value. |
| Acceptable | 80-82 | Can be employed in casual, less-critical circumstances |
# function to add rating
rating_category <- function(points){
if(points>=98){
return("Classic")
}
else if (points>=94){
return("Superb")
}
else if(points>=90){
return("Excellent")
}
else if(points>=87){
return("Very Good")
}
else if(points>=83){
return("Good")
}
else{
return("Acceptable")
}
}
wines<- wines %>%
rowwise() %>%
mutate(rating_category = rating_category(points))
head(wines)
Since, each reviewer has a different bias we created a normalized metric, norm_points, by looking at the number of standard deviatioins a wine is from the reviewer’s avg_points. This gives use a more accurate representation of which which wines are “better” than the rest.
normalize_points <- function(data){
left_join(data, tasters, by = "taster_name")%>%
rowwise() %>%
mutate(norm_points = (points-avg_points)/sd_points) %>%
select(-avg_points, -sd_points, -var_points, -taster_twitter_handle, -reviews)
}
wines <- normalize_points(wines)
head(wines)
Vintage seems to have year 7200, so we filtered all data up to 2019
wines <- wines %>%
filter(vintage<2019)
Before, conducting any detailed analysis of our dataset looked at a quiz summary of dataset
summary(wines)
title
Château Lestage Simon 2012 Haut-Médoc : 4
Château Maine-Gazin 1997 Bordeaux : 4
Concannon 2009 Selected Vineyards, Central Coast Pinot Noir (Livermore Valley): 4
Domaine Vacheron 2015 Sancerre : 4
Vignerons des Pierres Dorées 2015 Salamandre d'Or (Coteaux Bourguignons) : 4
Adega Cooperativa de Borba 2009 Senses Syrah (Alentejano) : 3
(Other) :117855
alcohol category vintage designation
Min. : 2.20 Dessert : 1648 Min. :1150 :35285
1st Qu.: 13.00 Fortified : 60 1st Qu.:2009 Reserve : 1924
Median : 13.53 Port/Sherry: 386 Median :2011 Estate : 1236
Mean : 13.84 Red :72354 Mean :2010 Reserva : 1214
3rd Qu.: 14.40 Rose : 4203 3rd Qu.:2013 Riserva : 669
Max. :8333.00 Sparkling : 2273 Max. :2017 Estate Grown: 584
NA's :9060 White :36954 (Other) :76966
country province region
US :50228 California :33418 :19360
France :19164 Washington : 8106 Napa Valley : 4234
Italy :17497 Bordeaux : 5665 Columbia Valley (WA): 3855
Spain : 5890 Tuscany : 5647 Russian River Valley: 2896
Portugal: 4905 Oregon : 4824 California : 2200
Chile : 4235 Northern Spain: 3712 Paso Robles : 2193
(Other) :15959 (Other) :56506 (Other) :83140
subregion variety
:70775 Pinot Noir :12071
Central Coast :10277 Chardonnay :10813
Sonoma : 8527 Cabernet Sauvignon : 8933
Columbia Valley : 7617 Red Blend : 8233
Napa : 6450 Bordeaux-style Red Blend: 6564
Willamette Valley: 3286 Riesling : 4893
(Other) :10946 (Other) :66371
winery price points
DFJ Vinhos : 208 Min. : 4.00 Min. : 80.00
Williams Selyem : 207 1st Qu.: 17.00 1st Qu.: 86.00
Testarossa : 205 Median : 25.00 Median : 88.00
Louis Latour : 193 Mean : 35.48 Mean : 88.47
Chateau Ste. Michelle: 192 3rd Qu.: 42.00 3rd Qu.: 91.00
Georges Duboeuf : 192 Max. :3300.00 Max. :100.00
(Other) :116681 NA's :7974
taster_name rating_category norm_points
:23945 Length:117878 Min. :-4.49580
Roger Voss :22265 Class :character 1st Qu.:-0.72711
Michael Schachner:13867 Mode :character Median : 0.03980
Kerin O’Keefe : 9747 Mean : 0.01183
Virginie Boone : 8926 3rd Qu.: 0.70027
Paul Gregutt : 8747 Max. : 4.46378
(Other) :30381
To better understand the distribution of our data we did some simple univariate visualization based on certain fields. Before, doing a multivariate analysis and answering our research questions we first want to ensure our dataset is robust and an accurate representation of the real world.
The graph below depicts the distribution of our dataset based on alcohol percentage, alcohol . To better understand and visualize the data we categorized the graph based on rating_category. Notice, a majority of wines have an alcohol amount between 12-15% and according to Real Simple wine averages between 11-13 percentage. This leads us to believe our data is an accurate representation of real world.
wines %>%
group_by(alcohol) %>%
ggplot() +
geom_histogram(
mapping = aes(
x = alcohol,
fill = rating_category),
na.rm = TRUE,
bins = 50) +
scale_x_continuous(
breaks = seq(0,25,1),
limits = c(4,22)) +
labs(
title = "Distribution of Alcohol Percentage",
x = "Alcohol Percentage",
y = "Count",
fill = "Category"
)
Grouping rowwise data frame strips rowwise nature
Next, we wanted to see what vintage most of the wines in the dataset were. Again to better understand and visualize the data we categorized the graph based on rating_category. Notice, that there is roughly, an equal percentage of each category per vintage further enhancing our confidence in our dataset robustness. (Note: Data points before 1990 have been omitted for clarity in visualization)
wines %>%
ggplot() +
geom_bar(
mapping = aes(
x=vintage,
fill = rating_category),
na.rm = TRUE) +
scale_x_continuous(
breaks = seq(1990,2019,5),
limits = c(1990,2019)) +
labs(
title = "Distribution of Vintage",
x = "Vintage",
y = "Count",
fill = "Category")
To better understand the number wines per winery, we did a visualization that counts the number of wines per winery showing only Top 10 winerys to give you an idea what winery has the most selction of wines. Notice, each of the top 10 producers of wine have over 100 different wine labels.
wines %>%
group_by(winery) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
ggplot() +
geom_col(
mapping = aes(
x= reorder(winery, count),
y = count,
fill = winery)) +
labs(
title = "Distribution of Vintage (Top 10)",
x = "Vintage",
y = "Count"
) +
theme(legend.position = "none") +
coord_flip()
Grouping rowwise data frame strips rowwise nature
To better understand the number wines per province, we did a visualization that counts the number of wines per province showing only the top 10 provinces with the most wines. This can give the reader an idea where their wine will most likely be made with California standing out as a clear leader.
wines %>%
group_by(province) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10) %>%
ggplot()+
geom_col(
mapping = aes(
x = reorder(province, count),
y = count,
fill = province)) +
labs(
title = "Distribution of Province (Top 10)",
x = "Province",
y = "Count") +
theme(legend.position = "none") +
coord_flip()
Grouping rowwise data frame strips rowwise nature
Next, we wanted to visualize the distribution of price in our dataset. To better understand and visualize the data we categorized the graph based on rating_category.Notice, a majority of wines are $50 and below with the most common being between $12 - $25. Again, this accurately represents the real world as stated by Vivino the average price for good bottle of red/white wine is ~$15 and ~$28 for a very good bottle. (CAUTION: The Vivino prices denoted were simply an average for red/white wines average costs. This was done to simply generalize the information to do a more accurate apples to apples comparsion. Also, this limited to red/white wine and does not accurately include other types) (Note: Data points above $400 have been omitted for clarity in visualization)
wines %>%
filter(price < 400) %>%
ggplot() +
geom_histogram(
mapping = aes(
x=price,
fill = rating_category),
binwidth = 15) +
labs(
title = "Distribution of Price",
x = "Price",
y = "Count",
fill = "Category")
Next, we wanted to visualize the distribution of points in our dataset. Notice, here that a majority of wines recieve a score between 87 and 90. Which is accurate to the information provided on ()[]
wines %>%
ggplot() +
geom_histogram(
mapping = aes(x=points),
bins = 20)
To help you understand the point distribution by reviewers, we did a multivarite visualization that coorelates some taster names based on the average wine points as identified by the x-intercept. This give you the reader an idea of how some reviewers correlate to the overall average.
wines %>%
ggplot() +
geom_boxplot(aes(y=taster_name, x=points, color = taster_name)) +
geom_vline(xintercept = mean(wines$points)) +
theme(
legend.position = "none")
TODO EXPLAIN VAMSI Notice the data is “stacked” and the socres range from 80-100
wines %>%
ggplot() +
geom_point(mapping = (aes(x = points, y = price, color = rating_category)), na.rm = T, alpha = 0.2) +
labs(title = "Price by Points", x = "Points", y = "Price") +
geom_smooth(mapping = (aes(x = points, y = price))) +
theme(legend.position = "none")
TODO EXPLAIN VAMSI
wines %>%
ggplot() +
geom_point(mapping = (aes(x = points, y = log(price), color = rating_category)), na.rm = T, alpha = 0.15) +
labs(title = "log(Price) by Points", x = "Points", y = "log(Price)") +
geom_smooth(mapping = (aes(x = points, y = log(price))))
TODO EXPLAIN IZZY This
wines %>%
group_by(points) %>%
filter(price < 1000) %>%
ggplot() +
geom_point(mapping = aes(x=points, y = price, color = category),
na.rm = T) +
facet_wrap(~ category) +
scale_color_fairyfloss() +
theme_minimal()
To determine the best province for wine by points, we averaged the points of all wines per province with a sample size greater than 30 and returned the top 10 with standard error. Standard error helps us determine the spread of the dataset. The graph below shows top 10 best provinces by average points with respect to standard error.
TODO EXPLAIN OSAKI To determine the best province for wine by points we average all the wines per province and return the top 10 with standard error.
wines %>%
group_by(province) %>%
summarise(avg_points_prov = mean(points), count = n(), std_points_prov_err = sd(points)/sqrt(count)) %>%
filter(count>30) %>%
arrange(desc(avg_points_prov)) %>%
slice(1:10) %>%
ggplot() +
geom_col(mapping = aes(y=province, x= avg_points_prov)) +
geom_errorbar(
mapping = aes(
y = province,
x = avg_points_prov,
xmin = avg_points_prov - std_points_prov_err,
xmax = avg_points_prov + std_points_prov_err
),
width = 0.2)+
labs(y = 'Province', x = "Average Points", title = "Average Points By Province (Top 10)")
To determine the best variety of wine we use the average point of all wines per variety with a sample size greater than 30. The graph below shows the the top 10 varieties with their respective standard error.
wines %>%
group_by(variety) %>%
summarise(
avg_points_variety = mean(points),
count = n(),
sd_err_points_variety = sd(points)/sqrt(count)) %>%
filter(count>30) %>%
arrange(desc(avg_points_variety)) %>%
slice(1:10) %>%
ggplot() +
geom_col(mapping = aes(y=variety, x=avg_points_variety))+
geom_errorbar(
mapping = aes(
y = variety,
x = avg_points_variety,
xmin = avg_points_variety - sd_err_points_variety,
xmax = avg_points_variety + sd_err_points_variety
),
width = 0.2
)
TODO IZZY
user_price <- readline(prompt = "How much are you willing to spend on a bottle?")
user_price <- as.integer(user_price)
wines %>%
filter(price <= user_price) %>%
arrange(desc(points)) %>%
select(title, price, points)
A easy way to determine the best wine is by simply finding the top 10 wines.
wines %>%
arrange(desc(points)) %>%
slice(1:10)
However, this does not account for the graders bias. Instead, our group “normalized” the points based on each taster based on the number of standard deviations an wines is from the raters average. For example, Taster A could give a wine 100 but has an avgerage rating score of 95 with a standard deviation of 5. Whereas, Taster B could give a wine 91 and have an average score of 87 with a standard deviation of 2. Although, the wine tasted by Taster A got a perfect 100 score, Taster B’s wine was much “better” wine since it was 2 standard deviations from the tasters avgerage compared to 1 standard deviation of the other wine.
Looking at the norm_points these are the top 10 best wines
wines %>%
arrange(desc(norm_points)) %>%
slice(1:10)
A simple value metric we can use to determine best value is points/price
wines %>%
arrange(desc(points/price)) %>%
slice(1:10)
However, again this metric is not normalized. Instead, norm_points/price would yield more robust results.
wines %>%
arrange(desc(norm_points/price)) %>%
slice(1:10)